Preface

In this assignment we will exercise some of the unsupervised learning approaches on 2016 Global Health Observatory data. It is available at that website in the form of Excel file, but its cleaned up version ready for import into R for further analyses is available at CSCI E-63C canvas course web site whs2016_AnnexB-data-wo-NAs.txt. The cleaning and reformatting included: merging data from the two parts of Annex B, reducing column headers to one line with short tags, removal of “>”, “<” and whitespaces, conversion to numeric format and replacement of undefined values (as indicated by en-dash’es in the Excel) with corresponding averages of those attributes. The code that was used to format merged data is shown at the end of this document for your reference only. You are advised to save yourself that trouble and start from preformatted text file available at the course website as shown above. The explicit mapping of short variable names to their full description as provided in the original file is available in Excel file whs2016_AnnexB-reformatted.xls also available on the course canvas page. Lastly, you are advised to download a local copy of this text file to your computer and access it there (as opposed to relying on R ability to establish URL connection to canvas that potentially requires login etc.)

Short example of code shown below illustrates reading this data from a local copy on your computer (assuming it has been copied into current working directory of your R session – getwd() and setwd() commands are helpful to find out what is it currently and change it to desired location) and displaying summaries and pairs plot of five (out of almost 40) arbitrary chosen variables. This is done for illustration purposes only – the problems in the assignment expect use of all variables in this dataset.

whsAnnBdatNum <- read.table("whs2016_AnnexB-data-wo-NAs.txt",sep="\t",header=TRUE,quote="")
summary(whsAnnBdatNum[,c(1,4,7,10,17)])
##      TOTPOP          LIFEXPB.B       MEDBIRTHS          NEWHIV      
##  Min.   :      2   Min.   :50.10   Min.   :  9.00   Min.   : 0.100  
##  1st Qu.:   1876   1st Qu.:66.03   1st Qu.: 78.50   1st Qu.: 0.300  
##  Median :   8070   Median :72.50   Median : 96.00   Median : 1.563  
##  Mean   :  37696   Mean   :71.29   Mean   : 84.28   Mean   : 1.563  
##  3rd Qu.:  26413   3rd Qu.:76.38   3rd Qu.: 99.00   3rd Qu.: 1.563  
##  Max.   :1383925   Max.   :83.70   Max.   :100.00   Max.   :20.100  
##      ALCONS      
##  Min.   : 0.000  
##  1st Qu.: 2.900  
##  Median : 6.085  
##  Mean   : 6.085  
##  3rd Qu.: 8.700  
##  Max.   :17.400
pairs(whsAnnBdatNum[,c(1,4,7,10,17)])

In some way this dataset is somewhat similar to the USArrests dataset extensively used in ISLR labs and exercises – it collects various continuous statistics characterizing human population across different territories. It is several folds larger though – instead of 50 US states and 4 attributes in USArrests, world health statistics (WHS) data characterizes 194 WHO member states by 37 variables. Have fun!

The following problems are largely modeled after labs and exercises from Chapter 10 ISLR. If anything presents a challenge, besides asking questions on piazza (that is always a good idea!), you are also encouraged to review corresponding lab sections in ISLR Chapter 10.

Problem 1: Principal components analysis (PCA) (25 points)

Sub-problem 1a: means and variances of WHS attributes (5 points)

Compare means and variances of the attributes in the world health statisics dataset – plot of variance vs. mean is probably the best given number of attributes in the dataset. Function apply allows to apply desired function (e.g. mean or var) to each row or column in the table. Do you see all 37 attributes in the plot, or at least most of them? (Remember that you can use plot(inpX,inpY,log="xy") to use log-scale on both horizontal and vertical axes.) What is the range of means and variances when calculated on untransformed data? Which are the top two attributes with highest mean or variance? What are the implications for PCA rendition of this dataset (in two dimensions) if applied to untransformed data?

#Thanks for cleaning the data!
# Reading in the countries
countries <- row.names(whsAnnBdatNum)

#Means of GHO statistics
gho.mean <- apply(whsAnnBdatNum,2,mean)
cat("Mean of untransformed values is:", range(gho.mean))
## Mean of untransformed values is: 0.2245902 9817345
#Variances of GHO statistics
gho.var <- apply(whsAnnBdatNum,2,var)
cat("Variance of untransformed values is:", range(gho.var))
## Variance of untransformed values is: 0.4330536 1.958163e+15
#Sorting the data before plotting
gho.data <- data.frame(gho.mean, gho.var)
gho.data <-gho.data[order(gho.mean),]

plot(gho.data$gho.mean, gho.data$gho.var, log="xy", xlab="log(mean) GHO statistics",ylab="log(var) GHO statisics")

#Top 2 Mean
cat("The 2 variables with the highest mean are:",names(gho.mean)[which(gho.mean %in% tail(sort(gho.mean), 2))])
## The 2 variables with the highest mean are: TOTPOP INTINTDS
#Top 2 Variance
cat("The 2 variables with the highest variance are:",names(gho.var)[which(gho.var %in% tail(sort(gho.var), 2))])
## The 2 variables with the highest variance are: TOTPOP INTINTDS
#
# If we try using PCA on the untransformed data, given the different units used in the reporting, we will have a biplot that where the signal from the largest variables drown out the others

Sub-problem 1b: PCA on untransformed data (10 points)

Perform PCA on untransformed data in WHS dataset (remember, you can use R function prcomp for that). Generate scree plot of PCA results (by calling plot on the result of prcomp) and plot of the two first principal components using biplot. Which variables seem to predominantly drive the results of PCA when applied to untransformed data?

#Performing PCA on untransformed data
pc.out <- prcomp(whsAnnBdatNum)

#Scree plot of Principal Components
plot(pc.out)

biplot(pc.out)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

#From the scree plot we can see that the first Principal Component 
#accounts for pretty much the entire variance.
#From the Biplot, it seems that INTINTDS, the variable that had by far the highest mean and Variance is driving this
#

Please note that in this case you should expect biplot to generate substantial number of warnings. Usually in R we should pay attention to these and understand whether they indicate that something went wrong in our analyses. In this particular case they are expected – why do you think that is?

#As a guess, I think this is due to the principal components being too small to plot on the same scales

The field rotation in the output of prcomp contains loadings of the 1st, 2nd, etc. principal components (PCs) – that can interpreted as contributions of each of the attributes in the input data to each of the PCs. What attributes have the largest (by their absolute value) loadings for the first and second principal component? How does it compare to what you have observed when comparing means and variances of all attributes in the world health statistics dataset?

pc.rot <- pc.out$rotation

#Get first 2 principal components
pc.rot2 <- pc.rot[,1:2]

#Variable with largest loading 
row.names(pc.rot2)[which.max(abs(pc.rot2))]
## [1] "INTINTDS"
#This is consistent with what we expected, as PCA creates components based on the highest variance, and INTINTDS had by far the largest variance.

Calculate percentage of variance explained (PVE) by the first five principal components (PCs). You can find an example of doing this in ISLR Chapter 10.4 (Lab 1 on PCA).

# Getting the variances from the squared SD for the first 5 principal compenents
pr.var <- pc.out$sdev[1:5] ^2

# Percent variance explained is the proportion of the variance from each Principal component relative to the total variance
pve <- pr.var/sum(pr.var)

Lastly, perform PCA on transposed (but still untransformed) WHS dataset – remember that in R t(x) returns transpose of x:

matrix(1:6,ncol=3)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
t(matrix(1:6,ncol=3))
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6
pc.tr <- prcomp(t(whsAnnBdatNum))

Present results of PCA on transposed world health statistics dataset in the form of scree and biplot, describe the results.

plot(pc.tr)

biplot(pc.tr)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Sub-problem 1c: PCA on scaled WHS data (10 points)

Perform PCA on scaled world health statistics data. To do that you can either use as input to prcomp the output ofscale as applied to the WHS data matrix or call prcomp with parameter scale set to TRUE. Present results of PCA in the form of scree plot and plot of the first two principal components. How do they compare to those generated on the results of PCA of untransformed data? What dataset attributes contribute the most (by absolute value) to the top two PCs? What are the signs of those contributions? How would you interpret that?

#Scaline prior to PCA
pc.scale <- prcomp(whsAnnBdatNum,scale=TRUE)


plot(pc.scale)

biplot(pc.scale, cex=.5)

The output of biplot with almost 200 text labels on it is pretty busy and could be tough to read. You can achieve better control when plotting PCA results if instead you plot the first two columns of the x attribute in the output of prcomp – e.g. plot(prcomp(USArrests,scale=T)$x[,1:2]). Use this to label a subset of countries on the plot – you can use text function in R to add labels at specified positions on the plot – please feel free to choose several countries of your preference and discuss the results. Alternatively, indicate US, UK, China, India, Mexico, Australia, Israel, Italy, Ireland and Sweden and discuss the results. Where do the countries you have plotted fall in the graph? Considering what you found out about contributions of different attributes to the first two PCs, what do their positions tell us about their (dis-)similarities in terms of associated health statistics?

#countries of interest
countries <- c("UnitedStatesofAmerica" , "UnitedKingdom" , "China", "India", "Mexico", "Australia", "Israel", "Italy", "Ireland","Sweden")

#Plot of principal components
plot(pc.scale$x[,1:2])

#Subset for countries of interest
pc.count <- pc.scale$x[row.names(pc.scale$x) %in% countries,1:2]

#Highlighted countries
points(pc.count,col="red",pch=19)

#Labelled countries of interest
text(pc.count,labels=row.names(pc.count),cex=.8,col="blue")

Finally, perform PCA on transposed scaled WHS dataset – present results in the form of scree plot and biplot and discuss these presentations.

tr <- t(whsAnnBdatNum)
pca.tran <- prcomp(tr, scale=TRUE)

plot(pca.tran)

biplot(pca.tran, cex=.5)

#After transposing the countries are now the variables, and the stats are the observations
#The scaling is done on the variables and not on the obeservation, so we still have a plot resembling the unscaled plot

tr.means <- apply(tr,1,mean)
tr.vars <- apply(tr,1,var)

range(tr.means)
## [1] 2.245902e-01 9.817345e+06
range(tr.vars)
## [1] 4.330536e-01 1.958163e+15

For extra 8 points

Try the following:

  • Instead of scaling (or prior to scaling) perform log-transform of the data before passing it on to prcomp. Given that some of the attributes have zero values, you will have to decide how to handle those to avoid negative infinity upon log-transformation. Usually, a small positive (often a fraction of smallest non-zero value for that attribute) value is added to all (or just zero) values of the attribute that contains zeroes. Present and describe the results.
  • Demonstrate equivalence of the results as obtained by prcomp(x) and cmdscale(dist(x)) where x represents scaled WHS dataset.
  • Explore use of multidimensional scaling (MDS) tools available in library MASS such as sammon and isoMDS. Present their results and discuss the differences between them and PCA output. No, there was nothing on that in the lecture – thus it is for extra points and to broaden your horizons.
log.data <- log(whsAnnBdatNum+1)
pr.log <- prcomp(log.data, scale = TRUE)
plot(pr.log)

sc.log <- cmdscale(dist(scale(log.data)))
sc.norm <- cmdscale(dist(whsAnnBdatNum))
apply(sc.log,2,range)
##           [,1]      [,2]
## [1,] -6.477745 -4.047202
## [2,]  9.572214  3.750603
apply(sc.norm,2,range)
##           [,1]       [,2]
## [1,]  -9817405 -1309848.0
## [2,] 567424756   144918.8
#As suspected, the range of the values in the log transformed data is significantly smaller that the range of the normal values

library(MASS)

Problem 2: K-means clustering (15 points)

Sub-problem 2a: k-means clusters of different size (5 points)

Using function kmeans perform K-means clustering on explicitly scaled (e.g. kmeans(scale(x),2)) world health statistics data for 2, 3 and 4 clusters. Use cluster attribute in the output of kmeans to indicate cluster membership by color and/or shape of the corresponding symbols in the plot of the first two principal components generated independently on the same (scaled WHS) data. E.g. plot(prcomp(xyz)$x[,1:2],col=kmeans(xyz,4)$cluster) where xyz is input data. Describe the results. Which countries are clustered together for each of these choices of \(K\)?

for (k in 2:4){
   km = kmeans(pc.scale$x[,1:2],k)
   plot(pc.scale$x[,1:2],col=km$cluster+1)
   
   #Labelled countries of interest
  text(pc.scale$x[,1:2],labels=row.names(pc.scale$x),cex=.5,col="black")

}

#When K=2, countries seem roughly divided on 1st and 3rd world nations/northern vs southern
#When K=3, the countries in blue are generally wealthy northern countries, in green are generally the developing southern countries, while the red is the African subcontinent countries
#When K=4, we get a further split of the mostly African and Middle Eastern countries with fairly recent large scale instability such as war.

Sub-problem 2b: variability of k-means clustering (5 points)

By default, k-means clustering uses random set of centers as initial guesses of cluster centers. Here we will explore variability of k-means cluster membership across several such initial random guesses. To make such choices of random centers reproducible, we will use function set.seed to reset random number generator (RNG) used in R to make those initial guesses to known/controlled initial state.

Using the approach defined above, repeat k-means clustering with four clusters three times resetting RNG each time with set.seed using seeds of 1, 2 and 3 respectively. Indicate cluster membership in each of these three trials on the plot of the first two principal components using color and/or shape as described above. Two fields in the output of kmeanstot.withinss and betweenss – characterize within and between clusters sum-of-squares. Tighter clustering results are those which have smaller ratio of within to between sum-of-squares. What are the resulting ratios of within to between sum-of-squares for each of these three k-means clustering results (with random seeds of 1, 2 and 3)?

Please bear in mind that the actual cluster identity is assigned randomly and does not matter – i.e. if cluster 1 from the first run of kmeans (with random seed of 1) and cluster 4 from the run with the random seed of 2 contain the same observations (country/states in case of WHS dataset), they are the same clusters.

kmeans4 <- function(upRange,...){
  for (i in 1:upRange){
    set.seed(i)
    km4 <- kmeans(pc.scale$x[,1:2],4,...)
    plot(pc.scale$x[,1:2],col=km4$cluster+1)
    cat("Ratio of within and between cluster sum of squares seed=",i," : " ,km4$tot.withinss/km4$betweenss,"\n")
  }
}

kmeans4(3)

## Ratio of within and between cluster sum of squares seed= 1  :  0.1796797

## Ratio of within and between cluster sum of squares seed= 2  :  0.1900712

## Ratio of within and between cluster sum of squares seed= 3  :  0.1796797

Sub-problem 2c: effect of nstarts parameter (5 points)

Repeat the procedure implemented for the previous sub-problem (k-means with four clusters for RNG seeds of 1, 2 and 3) now using 100 as nstart parameter in the call to kmeans. Represent results graphically as before. How does cluster membership compare between those three runs now? What is the ratio of within to between sum-of-squares in each of these three cases? What is the impact of using higher than 1 (default) value of nstart? What is the ISLR recommendation on this offered in Ch. 10.5.1?

kmeans4(3,nstart=100)

## Ratio of within and between cluster sum of squares seed= 1  :  0.1796797

## Ratio of within and between cluster sum of squares seed= 2  :  0.1796797

## Ratio of within and between cluster sum of squares seed= 3  :  0.1796797
#The groups created by the clustering are now identical for these three rns
#We can also see that the ratio of within to between sum-of-squares is now identical across the runs. 


#The advantage of using a higher nstart is that we can compare more starting position combinations and are more likely to find the global minima and not a local minima. A value of nstart 20 to 50 is highly recommended in ISLR

For extra 8 points

Try the following:

  • evaluate dependency between the stability of k-means clustering and the number of clusters and values of nstarts; to make this more quantitative consider using contingency table (i.e. table) to quantify concordance of two different clustering results (E.g. how many non-zero cells would be in the output of table for two perfectly concordant clustering assignments?)
  • Try using silhouette from the library cluster as another tool for assessing cluster strength for some of the clusters obtained here and describe the results
library(cluster)

set.seed(1)
kvals <- c(2,4,6,8,16)
nvals <- c(5,10,20,50)

#Effect of varying K
km2 <- kmeans(pc.scale$x[,1:2],2)
km4 <- kmeans(pc.scale$x[,1:2],4)
km6 <- kmeans(pc.scale$x[,1:2],6)
km8 <- kmeans(pc.scale$x[,1:2],8)
km16 <- kmeans(pc.scale$x[,1:2],16)

table(km2$cluster)
## 
##   1   2 
## 135  59
table(km4$cluster)
## 
##  1  2  3  4 
## 54 36 38 66
table(km6$cluster)
## 
##  1  2  3  4  5  6 
## 20 32 36 33 37 36
table(km8$cluster)
## 
##  1  2  3  4  5  6  7  8 
## 39  8 17 22 40 34 15 19
table(km16$cluster)
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 
## 14  6 10  8 17 13 18 15 32  8 16 13  3  6  7  8
max(table(km2$cluster))
## [1] 135
max(table(km4$cluster))
## [1] 66
max(table(km6$cluster))
## [1] 37
max(table(km8$cluster))
## [1] 40
max(table(km16$cluster))
## [1] 32
#Initially when the number of clusters are low, we can see that increasing the number of clusters, splits the largest clusters into subclusters, however, after a point, there seems some oscillation on the max cluster size and not necessarily decreasing like I would expect


set.seed(1)
#Effect of varying nstart
ns1  <- kmeans(pc.scale$x[,1:2],4)
ns5  <- kmeans(pc.scale$x[,1:2],4,nstart=5)
ns10 <- kmeans(pc.scale$x[,1:2],4,nstart=10)
ns20 <- kmeans(pc.scale$x[,1:2],4,nstart=20)
ns50 <- kmeans(pc.scale$x[,1:2],4,nstart=50)
ns100 <- kmeans(pc.scale$x[,1:2],4,nstart=100)
ns200 <- kmeans(pc.scale$x[,1:2],4,nstart=200)

sort(table(ns1$cluster))
## 
##  2  4  3  1 
## 20 39 54 81
sort(table(ns5$cluster))
## 
##  3  2  4  1 
## 36 38 54 66
sort(table(ns10$cluster))
## 
##  2  4  1  3 
## 20 39 54 81
sort(table(ns20$cluster))
## 
##  1  2  4  3 
## 20 39 54 81
sort(table(ns50$cluster))
## 
##  1  2  3  4 
## 20 39 54 81
sort(table(ns100$cluster))
## 
##  4  1  3  2 
## 20 39 54 81
sort(table(ns200$cluster))
## 
##  4  3  2  1 
## 20 39 54 81
#In this particular example, we can see that using a nstart of 1 doesnt do a good job of finding the optimal groups, however after nstart=10, it looks to have found the best subset as it doesnt change when increasing the number of starting positions.

Problem 3: Hierarchical clustering (20 points)

Sub-problem 3a: hierachical clustering by different linkages (10 points)

Cluster country states in (scaled) world health statistics data using default (Euclidean) distance and “complete”, “average”, “single” and “ward” linkages in the call to hclust. Plot each clustering hierarchy, describe the differences. For comparison, plot results of clustering untransformed WHS data using default parameters (Euclidean distance, “complete” linkage) – discuss the impact of the scaling on the outcome of hierarchical clustering.

whs.scaled = scale(whsAnnBdatNum)
par(mfrow=c(1,1))
for (method in  c("complete", "average", "single", "ward.D")){
  clust <- hclust(dist(whs.scaled), method=method)
  plot(clust, cex=.5,xlab="", ylab="",main = paste0("Dendogram: ", method))
}

#All dendograms, with the exception of the Ward method, have India as a leaf singleton. The single method seems quite hard to interpret as the distances between branches is mostly uniform. Both the average and complete methods show a stronger bushier heirachy.   

unscaled.clust = hclust(dist(whsAnnBdatNum))
plot(unscaled.clust,cex=.5,xlab="", ylab="",main = "Unscaled Dendogram")

# The difference that jumps out in the unscaled clustering is how strongly it is affected by the outlier India. The heirachy also seems pretty flat after 4 clusters have been identified. The height of the tree is most likely being skewed by this.

Sub-problem 3b: compare k-means and hierarchical clustering (5 points)

Using function cutree on the output of hclust determine assignment of the countries in WHS dataset into top four clusters when using Euclidean distance and “complete” linkage. Use function table to compare membership of these clusters to those produced by k-means clustering with four clusters in the Problem 2(c) above. Discuss the results.

hclust(dist(whs.scaled))
## 
## Call:
## hclust(d = dist(whs.scaled))
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 194
hclust4 <- cutree(hclust(dist(whs.scaled)),4)
kmeans4 <- kmeans(pc.scale$x[,1:2],4)

table(hclust4)
## hclust4
##   1   2   3   4 
##  55 137   1   1
table(kmeans4$cluster)
## 
##  1  2  3  4 
## 38 66 54 36
#The heirachical clustering was performed on the 2 first principal components
#whereas the kmeans was done on the scaled data
#The Heirachical clustering has produced 2 singletons on a cluster of 4
#While the kmeans has produced 4 rather similar sized clusters


hclust.pc4 <- cutree(hclust(dist(pc.scale$x[,1:2])),4)
table(hclust.pc4,kmeans4$cluster)
##           
## hclust.pc4  1  2  3  4
##          1 22  0  0 22
##          2  0 10 52  1
##          3  0 56  2 13
##          4 16  0  0  0
#Indeed, if we compare against the hclust done on the 2 Principal components we get more similar results

Sub-problem 3c: cluster variables by correlation (5 points)

Use (casted as distance) one-complement of Spearman correlation between attributes in world health statistics dataset to cluster attributes of WHS dataset. E.g. hclust(as.dist(1-cor(xyz,method="spearman"))) would cluster columns (as opposed to rows) in the matrix xyz. Plot the results – which variables tend to cluster together, why do you think that is? Compare results obtained by this approach for scaled and untransformed WHS dataset? How do they compare? What do you think is the explanation?

sp.clust <- hclust(as.dist(1-cor(whsAnnBdatNum,method="spearman")))
plot(sp.clust, cex=.5)

sp.clust.scaled <- hclust(as.dist(1-cor(whs.scaled,method="spearman")))
plot(sp.clust.scaled, cex=.5)

range(as.dist(1-cor(whsAnnBdatNum,method="spearman")))
## [1] 0.009313115 1.917612409
range(as.dist(1-cor(whs.scaled,method="spearman")))
## [1] 0.009313115 1.917612409
#This is due to the correlation calculated on both the scaled and unscaled data being the same. 
#As part of the correlation formula involves scaling the variables ((x-x_bar)/sigma_x)..., the difference in scale will not affect the correlation matrix.
#

For extra 4 points

Use contingency tables to compare cluster memberships for several top clusters across different choices of linkage (e.g. “complete”,“ward”,“single”) and distance (Euclidean, Manhattan, one-complement of correlation coefficient). Discuss the results.

table(cutree(hclust(dist(whs.scaled),method="complete"),4))
## 
##   1   2   3   4 
##  55 137   1   1
table(cutree(hclust(dist(whs.scaled),method="average"),4))
## 
##   1   2   3   4 
## 191   1   1   1
table(cutree(hclust(dist(whs.scaled),method="single"),4))
## 
##   1   2   3   4 
## 191   1   1   1
table(cutree(hclust(dist(whs.scaled),method="complete"),6))
## 
##   1   2   3   4   5   6 
##  55 135   1   1   1   1
table(cutree(hclust(dist(whs.scaled),method="average"),6))
## 
##   1   2   3   4   5   6 
##  50 140   1   1   1   1
table(cutree(hclust(dist(whs.scaled),method="single"),6))
## 
##   1   2   3   4   5   6 
## 189   1   1   1   1   1
#When we are only picking the top 4, we can see that the single and average methods produce the same result for this data. However, changing the cut to 6, shows that the complete and average methods are more similar

table(cutree(hclust(dist(whs.scaled,method="manhattan")),6))
## 
##   1   2   3   4   5   6 
##  39 120  27   5   1   2
table(cutree(hclust(dist(whs.scaled,method="euclidean")),6))
## 
##   1   2   3   4   5   6 
##  55 135   1   1   1   1
table(cutree(hclust(dist(whs.scaled,method="maximum")),6))
## 
##   1   2   3   4   5   6 
## 189   1   1   1   1   1
#The euclidean in this example creates more singletons than the manhattan, this may be due to clustering at the high level it is distinguishing based on relative position (left or right), whereas the manhattan, being performed on the absolute distance has a more shallow tree. The maximum distance method, in this case has clusterd all six into the same group.

Appendix: pre-processing of WHS data

For your reference only – the file it generated is already available on course website

whsAnnBdat <- read.table("../data/whs2016_AnnexB-data.txt",sep="\t",header=T,as.is=T,quote="")
dim(whsAnnBdat)
whsAnnBdat <- apply(whsAnnBdat,2,function(x)gsub(">","",gsub("<","",gsub(" ","",x))))
whsAnnBdat <- apply(whsAnnBdat,2,function(x){x[x==rawToChar(as.raw(150))]<-"";x})
rownames(whsAnnBdat) <- whsAnnBdat[,1]
whsAnnBdat <- whsAnnBdat[,-1]
whsAnnBdatNum <- apply(whsAnnBdat,2,as.numeric)
whsAnnBdatNum <- apply(whsAnnBdatNum,2,function(x){x[is.na(x)] <- mean(x,na.rm = TRUE);x})
rownames(whsAnnBdatNum) <- rownames(whsAnnBdat)
write.table(whsAnnBdatNum,"../data/whs2016_AnnexB-data-wo-NAs.txt",quote=F,sep="\t")